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Abstract 

In this paper, we describe a randomized Shellsort algorithm. This algorithm is a simple, random- 
ized, data-oblivious version of the Shellsort algorithm that always runs in 0(n log n) time and succeeds in 
sorting any given input permutation with very high probability. Taken together, these properties imply 
applications in the design of new efficient privacy-preserving computations based on the secure multi- 
party computation (SMC) paradigm. In addition, by a trivial conversion of this Monte Carlo algorithm 
to its Las Vegas equivalent, one gets the first version of Shellsort with a running time that is provably 
0(n log n) with very high probability. 

1 Introduction 

July 2009 marked the 50th anniversary of the Shellsort algorithm [37j . This well-known sorting algorithm 
(which should always be capitalized, since it is named after its inventor) is simple to implement. Given a 
sequence of offset values, (pi, 02, . . . , o p ), with each Oi < n, and an unsorted array A, whose n elements are 
indexed from to n — 1, the Shellsort algorithm (in its traditional form) is as follows: 

for i = 1 to p do 

for j = to Oi — 1 do 

Sort the subarray of A consisting of indices j,j + Oj, j + 2o i; . . ., e.g., using insertion-sort. 

In fact, even this traditional version of Shellsort is actually a family of algorithms, since there are so many 
different offset sequences. The trick in implementing a traditional version of Shellsort, therefore, is coming 
up with a good offset sequence. Pratt [35] shows that using a sequence consisting of all products of powers 
of 2 and 3 less than n results in a worst-case running time of 0(n log 2 n). Several other offset sequences have 
been studied (e.g., see the excellent survey of Sedgewick [11]), but none beat the asymptotic performance 
of the Pratt sequence. Moreover, Plaxton and Suel [37J establish a lower bound of f2(nlog 2 n/(loglogn) 2 ) 
for the worst-case running time of Shellsort with any input sequence (see also |13| ) and Jiang et al. |25j 
establish a lower bound of Q(pn 1+1 / p ) for the average-case running time of Shellsort. Thus, the only way to 
achieve an O(nlogn) average-time bound for Shellsort is to use an offset sequence of length 0(logn), and, 
even then, the problem of proving an O(nlogn) average running-time bound for a version of Shellsort is a 
long-standing open problem [4"4"] . 

The approach we take in this paper is to consider a variant of Shellsort where the offset sequence is a 
fixed sequence, (01,02,.. . ,o p ), of length O(logn) — indeed, we just use powers of two — but the "jumps" for 
each offset value in iteration i are determined from a random permutation between two adjacent regions in 
A of size Oi (starting at indices that are multiples of Oj). The standard Shellsort algorithm is equivalent 
to using the identity permutation between such region pairs, so it is appropriate to consider this to be a 
randomized variant of Shellsort. 

In addition to variations in the offset sequence and how it is used, there are other existing variations to 
Shellsort, which are based on replacing the insertion-sort in the inner loop with other actions. For instance, 
Dobosiewicz [14] proposes replacing the insertion-sort with a single linear-time bubble-sort pass — doing 
a left-to-right sequence of compare-exchanges between elements at offset-distances apart — which will work 
correctly, for example, with the Pratt offset sequence, and which seems to work well in practice for geometric 
offset sequences with ratios less than 1.33 [2]. Incerpi and Sedgewick [M] [23] study a version of Shellsort 



that replaces the insertion-sort by a shaker pass (see also [21 HE])- This is a left-to-right bubble-sort pass 
followed by a right-to-left bubble-sort pass and it also seems to do better in practice for geometric offset 
sequences [24] . Yet another modification of Shellsort replaces the insertion-sort with a brick pass, which 
is a sequence of odd-even compare-exchanges followed by a sequence of even-odd compare-exchanges [33] . 
While these variants perform well in practice, we are not aware of any average-case analysis for any of these 
variants of Shellsort that proves they have an expected running time of O(nlogn). Sanders and Fleischer |41j 
describe an algorithm they call "randomized Shellsort," which is a data-dependent Shellsort algorithm as in 
the above pseudo-code description, except that it uses products of random numbers as its offset sequence. 
They don't prove an O(nlogn) average-time bound for this version, but they do provide some promising 
empirical data to support an average running time near 0(n log n); see also |33j . 

1.1 Data-Oblivious Sorting 

In addition to its simplicity, one of the interesting properties of Shellsort is that many of its variants are data- 
oblivious. Specifically, if we view compare-exchange operations as a reliable^] primitive (i.e., as a "black 
box"), then Shellsort algorithms with bubble-sort passes, shaker passes, brick passes, or any combination of 
such sequences of data-independent compare-exchange operations, will perform no operations that depend on 
the relative order of the elements in the input array. Such data-oblivious algorithms have several advantages, 
as we discuss below. 

A data-oblivious algorithm for sorting a set of n items can alternatively be viewed as a sorting net- 
work [27J , where the elements in the input array are provided as values given on n input wires and internal 
gates are compare-exchanges. Ajtai, Komlos, and Szemeredi (AKS) [1] show that one can achieve a sorting 
network with O(nlogn) compare-exchange gates in the worst case, but their method is quite complicated 
and has a very large constant factor, even with subsequent improvements [SSIHS]. Leighton and Plaxton [5S] 
describe a randomized method for building a data-oblivious sorting network that uses O(nlogn) compare- 
exchange gates and sorts any given input array with very high probability. Unfortunately, even though the 
Leighton-Plaxton sorting algorithm is simpler than the AKS sorting network, it is nonetheless considered by 
some not to be simple in an absolute sense (e.g., see [44]). 

One can also simulate other parallel sorting algorithms or network routing methods, but these don't 
lead to simple time-optimal data-oblivious sequential sorting algorithms. For example, the online routing 
method of Arora et al. [2] is time-optimal but not data-oblivious, as are the PRAM sorting algorithms of 
Shavit et al. [46], Cole [11], Reif [40], and Goodrich and Kosaraju [20]. The shear-sort algorithm of Scherson 
and Sen [42] is simple and data-oblivious but not time-optimal. The columnsort algorithm of Leighton [28] 
and the sorting method of Maggs and Vocking [3D] are asymptotically fast, but they both employ the AKS 
network; hence, they are not simple. 

Finally, note that well-known time-optimal sorting algorithms, such as radix-sort, quicksort, heapsort, 
and mergesort (e.g., see O [TH] HU EH]), ar e not data-oblivious. In addition, well-known data-oblivious 
sorting algorithms, such as odd-even mergesort and Batcher's bitonic sort (e.g., see [27]), as well as Pratt's 
version of Shellsort [35], run in 0(nlog 2 n) time. Therefore, existing sorting algorithms arguably do not 
provide a simple data-oblivious sorting algorithm that runs in O(nlogn) time and succeeds with very high 
probability for any given input permutation. 

1.1.1 Modern Motivations for Simple Data- Oblivious Sorting 

Originally, data-oblivious sorting algorithms were motivated primarily from their ability to be implemented in 
special-purpose hardware modules |26j . Interestingly, however, there is a new, developing set of applications 
for data-oblivious sorting algorithms in information security and privacy. 

In secure multi-party computation (SMC) protocols (e.g., see [SJ El HI 12U EH] ) > two or more parties 
separately hold different portions of a set of data values, {xi,X2, ■ ■ ■ ,x n }, and are interested in computing 
some function, f{x\,X2, ■ ■ ■ , x n ), on these values. In addition, due to privacy concerns, none of the different 
parties is willing to reveal the specific values of his or her pieces of data. SMC protocols allow the parties 

1 We assume throughout this paper that compare-exchange operations always operate correctly; readers interested in fault- 
tolerant sorting should see, e.g., [3lf8lll7|. 
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to compute the value of / on their collective input values without revealing any of their specific data values 
(other than what can inferred from the output function, /, itself [19 ). 

One of the main tools for building SMC protocols is to encode the function / as a circuit and then 
simulate an evaluation of this circuit using digitally-masked values, as in the Fairplay system [SJ 03,- By 
then unmasking only the output value(s), the different parties can learn the value of / without revealing 
any of their own data values. Unfortunately, from a practical standpoint, SMC systems like Fairplay suffer 
from a major efficiency problem, since encoding entire computations as circuits can involve significant blow- 
ups in space (and simulation time). These blow-ups can be managed more efficiently, however, by using 
data-oblivious algorithms to drive SMC computations where only the primitive operations (such as MIN, 
MAX, AND, ADD, or compare-exchange) are implemented as simulated circuits. That is, each time such 
an operation is encountered in such a computation, the parties perform an SMC computation to compute 
its masked value, with the rest of the steps of the algorithm performed in an oblivious way. Thus, for a 
problem like sorting, which in turn can be used to generate random permutations, in a privacy-preserving 
way, being able to implement the high-level logic in a data-oblivious manner implies that simulating only 
the low-level primitives using SMC protocols will reveal no additional information about the input values. 
This zero-additional-knowledge condition follows from the fact that data-oblivious algorithms use their low- 
level primitive operations in ways that don't depend on input data values. Therefore, we would like to 
have a simple data-oblivious sorting algorithm, so as to drive efficient SMC protocols that use sorting as a 
subroutine. 

1.2 Our Results 

In this paper, we present a simple, data-oblivious randomized version of Shellsort, which always runs in 
O(nlogn) time and sorts with very high probability. In particular, the probability that it fails to sort any 
given input permutation will be shown to be at most l/n b , for constant b > 1, which is the standard for 
"very high probability" (v.h.p.) that we use throughout this paper. 

Although this algorithm is quite simple, our analysis that it succeeds with very high probability is not. 
Our proof of probabilistic correctness uses a number of different techniques, including iterated Chernoff 
bounds, the method of bounded average differences for Doob martingales, and a probabilistic version of 
the zero-one principle. Our analysis also depends on insights into how this randomized Shellsort method 
brings an input permutation into sorted order, including a characterization of the sortedness of the sequence 
in terms of "zones of order." We bound the degree of zero-one unsortedness, or dirtiness, using three 
probabilistic lemmas and an inductive argument showing that the dirtiness distribution during the execution 
of our randomized Shellsort algorithm has exponential tails with polylogarithmic dirtiness at their ends, with 
very high probability (w.v.h.p.). We establish the necessary claims by showing that the region compare- 
exchange operation simultaneously provides three different kinds of near-sortedness, which we refer to as a 
"leveraged-splitters." We show that, as the algorithm progresses, these leveraged-splitters cause the dirtiness 
of the central region, where zeroes and ones meet, to become progressively cleaner, while the rest of the array 
remains very clean, so that, in the end, the array becomes sorted, w.v.h.p. 

In addition to this theoretical analysis, we also provide a Java implementation of our algorithm, together 
with some experimental results. 

As a data-oblivious algorithm, our randomized Shellsort method is a Monte Carlo algorithm (e.g., see [3"H 
135]). in that it always runs in the same amount of time but can sometimes fail to sort. It can easily be 
converted into a data-dependent Las Vegas algorithm, however, which always succeeds but has a randomized 
running time, by testing if its output is sorted and repeating the algorithm if it is not. Such a data-dependent 
version of randomized Shellsort would run in O(nlogn) time with very high probability; hence, it would 
provide the first version of Shellsort that provably runs in 0(n log n) time with very high probability. 

2 Randomized Shellsort 

In this section, we describe our randomized Shellsort algorithm. As we show in the sections that follow, 
this algorithm always runs in 0(n log n) time and is highly likely to succeed in sorting any given input 
permutation. 
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Suppose that we are given an n-element array, A, that we wish to sort, where we assume, without loss 
of generality, that n is a power of 2. Our randomized Shellsort algorithm uses a geometrically decreasing 
sequence of offsets, O = {n/2, n/4, n/8, ... , 1}. For each offset, o 6 O, we number consecutive regions in A of 
length o, as 0, 1, 2, etc., with each starting at an index that is a multiple of o, so that region is A[0 .. o— 1], 
region 1 is A[o .. 2o — 1], and so on. We compare pairs of regions according to a schedule that first involves 
comparing regions by a shaker pass — an increasing sequence of adjacent-region comparisons followed by a 
decreasing sequence of adjacent-region comparisons. We then perform an extended brick pass, where we 
compare regions that are 3 offsets apart, then regions 2 offsets apart, and finally those that are odd-even 
adjacent and then those that are even-odd adjacent. We refer to this entire schedule as a shaker-brick 
pass, since it consists of a shaker pass followed by a type of brick pass. 

2.1 Region Compare-Exchange Operations 

Each time we compare two regions, say Ai and A2, of A, of size ogO each, we form c independent random 
matchings of the elements in A\ and A%, for a constant c > 1, which is determined in the analysis. For 
each such matching, we perform a compare-exchange operation between each pair of elements in A\ and 
A2 that are matched, in an iterative fashion through the c matchings. We refer to this collective set of 
compare-exchanges as a region compare- exchange. (See Figure [I]) 

v x X X X X X X X X \ \ 

/ / / y x x x x x x x ;*: \ \ \ 
••• • . ■■■ ••'';■<' y y yy y y ;• ;■■ ;•■ ;•■ 



(a) 



(b) 

Figure 1: The region compare-exchange operation. Connections are shown between pairs of regions colored 
white and their neighboring regions colored gray, under (a) the identity permutation and (b) a random 
permutation for each pair. 

for o = n/2, n/2 2 , n/2 3 , . . . , 1 do 

Let Ai denote subarray A[io ..io + o — 1], for i = 0,1,2, ... , n/o — 1. 
do a shaker pass: 

Region compare-exchange Ai and Ai + i, for i = 0,1,2, ... , n/o — 2. 

Region compare-exchange A i+ i and A i: for i = n/o — 2, . . . , 2, 1, 0. 
do an extended brick pass: 

Region compare-exchange Ai and ^+3, for i = 0, 1, 2, . . . , n/o — 4. 

Region compare-exchange Ai and ^+2, for i = 0, 1, 2, . . . , n/o — 3. 

Region compare-exchange Ai and Ai + i, for even i = 0,1,2, ... ,n/o — 2. 

Region compare-exchange Ai and A4+1, for odd i = 0, 1, 2, . . . , n/o — 2. 

Figure 2: A Pseudo-code description of our randomized Shellsort algorithm. 
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2.2 The Core Algorithm 



A pseudo-code description of our randomized Shellsort algorithm, which assumes n is a power of two, is as 
shown in Figure [2j (We also provide a complete Java implementation in Figure [5j in Section [5j) 

Clearly, the description of our randomized Shellsort algorithm shows that it runs in O(nlogn) time, since 
we perform 0(n) compare-exchange operations in each of logn iterations. 



2.3 Adding a Cleanup Phase 

Even though the above randomized Shellsort algorithm works, as is, in practice (e.g., see Section[5|, we make 
a minor addition to the core algorithm here for the sake of proving a high-probability bound. In particular, 
we add a cleanup postprocessing phase at the end of the core algorithm that takes care of any stray elements 
that are out of place, provided there are not too many such elements. This modification is probably an 
artifact of our analysis, not the algorithm itself, but it is nevertheless helpful in proving a high- probability 
bound. 

Define an n-element array, A, to be m-near- sorted if all but m of the n elements in A are in sorted 
order. A p-sorter [4j |5] is a deterministic sorting algorithm that can sort a subarray of size p as an atomic 
action. Suppose S is a data-oblivious (deterministic) 2m-sorter that runs in T(m) time. Define an S-shaker 
pass over A to consist of a use of S at positions that are multiples of m going up A and then down. That 
is, an S-shaker pass is defined as follows: 



for i = to n — 2m incrementing by steps of m do 

Use S to sort A[i .. i + 2m — 1]. 
for i = n — 2m to decrementing by steps of m do 

Use S to sort A[i .. i + 2m — 1]. 



To show that this method sorts an m-near-sorted array A, we make use of the zero- one principle for 
sorting networks (which also applies to data-oblivious sorting algorithms): 

Theorem 2.1 ((Knuth |26j)) A sorting network (or data- oblivious sorting algorithm) correctly sorts all 
sequences of arbitrary inputs if and only if it correctly sorts all sequences of 0-1 inputs. 

The main idea behind this principle is that it allows us to reduce each case of distinguishing the k largest 
elements and the n — k smallest elements to an instance having k ones and n — k zeroes. This allows us to 
easily prove the following: 

Lemma 2.2 Given an m-near-sorted array A of size n, and a 2m-sorter S , running in T(m) time, a 
S-shaker pass over A will sort A in 0(T(m)n/m) time. 

Proof: Suppose A is an m-near-sorted binary array, consisting of k ones and n—k zeroes. Thus, there are at 
most m ones below position n — k in A and at most m zeroes after this position in A. Since it sorts subarrays 
of size 2m in an overlapping way, the forward loop in an S-shaker pass will move up all the lower-order ones 
so that there are no ones before position n — k — ml , where m! is the number of high-order zeroes. Thus, 
since m! < m, the backward loop in an S-shaker pass will move down all high-order zeroes so that there are 
no zeroes after position n—k. ■ 

We show below that the randomized Shellsort, as described in Section [2] will a polylog(n)-near-sort an 
input array A, with very high probability, for some constant a > 0. We can then use Pratt's version |38j of 
(deterministic) Shellsort as a 2a polylog(n)-sorter, S, in a S-shaker postprocessing pass over A, which will 



run in 0(n(loglogn) 2 ) time and (by Lemma 2.2 1 will complete the sorting of A. Note, in addition, that 
since we are using a Shellsort implementation in an S-shaker (Shellsort-type) pass, adding this postprocessing 
phase to our randomized Shellsort algorithm keeps the entire algorithm being a data-oblivious variant of the 
Shellsort algorithm. 
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3 Analyzing the Region Compare-Exchange Operations 



Let us now turn to the analysis of the ways in which region compare-exchange operations bring two regions 
in our size-n input array closer to a near-sorted order. We begin with a definition. 

3.1 Leveraged Splitters 

Ajtai, Komlos, and Szemeredi [T] define a X-halver of a sequence of 2N elements to be an operation that, 
for any k < N, results in a sequence so that at most Afc of the largest k elements from the sequence are 
in the first k positions, and at most Xk of the smallest k elements are in the last k positions. We define a 
related notion of a (/i, a, f3) -leveraged- splitter to be an operation such that, for the k < (1 — e)N largest 
(resp., smallest) elements, where < e < 1, the operation returns a sequence with at most 

max{a(l - e)^N, f3} 

of the k largest (smallest) elements on the left (right) half. Thus, a A-halver is automatically a (1,A, 0)- 
leveraged-splitter, but the reverse implication is not necessarily true. The primary advantage of the leveraged- 
splitter concept is that it captures the way that c random matchings with compare-exchanges has a modest 
impact with respect to a roughly equal number of largest and smallest elements, but they have a geometric 
impact with respect to an imbalanced number of largest and smallest elements. We show below that a region 
compare-exchange operation consisting of at least an appropriate constant number of random matchings is, 
with very high probability, a (a, /?, /i)-leveraged-splitter for each of the following sets of parameters: 

(j, = c+l, a = 1/2, and ,8 = 0, 
jji = c + l, a = (2e) c , and /3 = 4elogn, 
(j, = 0, a = 1/6, and /3 = 0, 
H = c+1, a = 1/2, and f3 = N/20, 



The fact that the single region compare-exchange operation is a (/i, a, /3)-leveraged-splitter for each of these 
different sets of parameters, /i, a, and /?, allows us to reason about vastly divergent degrees of sortedness 
of the different areas in our array as the algorithm progresses. For instance, we use the following lemma 
to reason about regions whose sortedness we wish to characterize in terms of a roughly equal numbers of 
smallest and largest elements. 

Lemma 3.1 Suppose a (0, A, 0) -leveraged-splitter is applied to a sequence of2N elements, and let (l — e)N < 
k < (1 + e)N and I = 2N — k, where < A < 1 and < e < 1. Then at most (A + e)N of the k largest 
elements end up in the left half of the sequence and at most (X + e)N of the I smallest elements end up in 
the right half of the sequence. 

Proof: Let us consider the k largest elements, such that (1 — e)N < k < N. After applying a (0, A, 0)- 
leveraged-splitter, there are at at most XN of the k largest elements on the left half of the sequence (under 
this assumption about k). Then there are at least N — XN of the / smallest elements on the left left; 
hence, at most I — (N — XN) of the I smallest elements on the right half. Therefore, there are at most 
N + eN — (N — XN) = (A + e)N of the I smallest elements on the right half. A similar argument applies to 
the case when N < k < (1 + e) N to establish an upper bound of at most (A + e) N of the k smallest elements 
on the left half. ■ 

Let us know turn to the proofs that region compare-exchange operations are (p, a, /3)-leveraged-splitters, 
for each of the sets of parameters listed above. 

3.2 The (c + 1, 1/2, (D)-Leveraged-Splitter Property 

We begin with the (c + 1, 1/2, 0)-leveraged-splitter property. So suppose Ai and are two regions of size 
N each that are being processed in a region compare-exchange operation consisting of c random matchings. 
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We wish to show that, for the k < (1 — e)N, largest (resp., smallest) elements, this operation returns a 
sequence such that there are at most 

2 

of the k largest (smallest) elements on the left (right) half. Without loss of generality, let us focus on the 
k largest elements. Furthermore, let us focus on the case where largest k elements are all ones and the 
2N — k smallest elements are all zeroes, since a region compare-exchange operation is oblivious. That is, if 
a region compare-exchange operation is a (c + 1, 1/2, 0)-leveraged-slitter in the zero-one case, then it is a 
(c + 1, 1/2, 0)-leveraged-slitter in the general case as well. 

Lemma 3.2 Suppose we are given two regions, A\ and A 2 , each of size N , and let k — k\ + k 2 , where k\ 
(resp., k2) is the number of ones in A\ (resp., A 2 ). Let k!^ be the number of ones in A\ after a single region 
compare- exchange matching. Then 

E(kP) = kl (I 

Proof: In order for a one to remain on the left side after a region compare-exchange matching, it must be 
matched with a one on the right side. The probability that a one on the left is matched with a one on the 
right is k 2 /N. ■ 

We use the above lemma in the proof of the following, which applies to the case when e is relatively large 
(that is, when (1 — e) is relatively small). 

Lemma 3.3 ((Fast-Depletion Lemma)) Given two binary regions, Ai and A 2 , each of size N, let k 
1: 1 + k 2 , where k\ and k 2 are the respective number of ones in A\ and A 2 , and suppose k < (1 — e)N , for 
1/4 < e < 1. Let fc^ be the number of ones in A\ after c random matchings (with compare-exchanges) in a 
region compare-exchange operation. Then 

Pr (ht~» > (1 ~ e ] C+liV ) < (2c- l )e -(i-r + W. 

Proof: The proof is by induction on the number, c, of random matchings. By a theorem of Hoeffding [22], 
the expected value of any convex function of the size of such a sample is bounded by the expected value 
of that function applied to the size of a similar sample with replacement. Thus, we can apply a Chernoff 
bound (e.g., see |34l 135) ) to this single random matching and pairwise set of compare-exchange operations. 
Note that, for the base case, 

E{k^ ) < fcj ((1 - e) - fci/iV) , 
which is maximized for k\ = (1 — e)N/2. Thus, 

EVP) < 

Therefore, by a well-known Chernoff bound (e.g., see [541, 135] h 



which establishes the base case. 

For the inductive case, c > 2, let us assume inductively that 

(2c-3) ^ (1 ~ ZY N \ < ( 2c - 3 ) e -( 1 - e ) CJV / 21 



2 



Recall that 

E{kf c - 2) ) < fcf c - 3) ((1 - e) - fcf c - 3) /N) , 
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which is maximized, in this case, for k[ 2c 3) = (1 - e) c N/2, since x(l — x) is a monotonic function for 
x € [0,1/2]. Thus, 

since (1 — e) < 3/4. Therefore, by a well-known Chcrnoff bound, 

Pr( fc r 2) >Q) (1 7 )CjV ) ^ Prfa 2 ^>^ ^ (1 ~ £)CJV 

< 

e 



6/ V 4 

,1/6 x 3(l-£) c JV/8 



(7/6)V 



-(l-e) c ]V/2 9 



So, let us assume now that 



2 



and consider one more random matching. Note that, in this case, 



E{kf c ~ X) ) < 



Therefore, by a well-known Chernoff bound, 



Pr(k^> {1 -T 1N ) = Prfe- 2 ^ 8 ^ 7 ^ 1 -^ 



< 



1 j V8 

/ e i/ 7 J(l-f) c+1 »/16 



< e -(l-e) c+1 Af/2 10 ^ 



Combining all the failure probabilities, as in a union bound, then, establishes the lemma. ■ 

By a symmetrical argument, we have similar result for the case of k < (1 — e)N zeroes that would wind 
up in A 2 after c random matchings (with compare-exchange operations between the matched pairs). Thus, 
we have the following. 



Corollary 3.4 If Ai and A 2 are two regions of size N each, then a compare-exchange operation consisting 
of 2c — 1 random matchings (with compare- exchanges between matched pairs) between A\ and A 2 is a (c - 
1, 1/2, 0) -leveraged- splitter with probability at least 1 — (2c— l)e" 



~(l-e) c+1 7V/2 1 



The above lemma and corollary are most useful for cases when the regions are large enough so that the 
above failure probability is below 0(n~ a ), for a > 1. 



3.3 The (c + 1, (2e) c , 4e logn)-Leveraged-Splitter Property 

When region sizes or (1— e) values are too small for Corollary |3.4| to hold, we can use the (c+1, (2e) c , 4e log n)- 
leveraged-splitter property of the region-compare operation. As above, we prove this property by assuming, 
without loss of generality, that we are operating on a zero-one array and by focusing on the k largest elements, 
that is, the ones. We also note that this particular (/x, a, /3)-leveraged-splitter property is only useful when 
(1 — e) < l/(2e), when considering the k < (1 — e)N largest elements (i.e., the ones), so we add this as a 
condition as well. 

Lemma 3.5 ((Little- Region Lemma)) Given two regions, A\ and A 2 , each of size N, let k = k\ + k 2 , 

where fci and k 2 are the respective number of ones in A\ and A 2 . Suppose k < (1 — e)N, where e satisfies 
(1 — e) < l/(2e). Let k[ c ^ be the number of ones in A\ after c region compare- exchange operations. Then 

Pr (k[ c) > max{(2e) c (l - e) c+1 iV, 4e log n}) < cn~ A . 
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Proof: Let us apply an induction argument, based on the number, d, of random matches in a region compare- 
exchange operation. Consider an inductive claim, which states that after after d random matchings (with 
compare-exchange operations), 

k[ c ' ] > max{(2e) c '(l - e) c ' +1 N , 4elogn}, 

with probability at most c'n~ 4 . Thus, with high probability k[ c ■* is bounded by the formula on the righthand 
side. The claim is clearly true by assumption for d = 0. So, suppose the claim is true for d, and let us 
consider d + 1. Since there is at most a (1 — e) fraction of ones in A 2 , 

H = E{k[ c ' +1) ) < (2e) c '(l - e) c ' +2 N. 

Moreover, the value k[ c can be viewed as the number of ones in a sample without replacement from A 2 

of size k[ c ' . By a theorem of Hoeffding [52] , then, the expected value of any convex function of the size of 
such a sample is bounded by the expected value of that function applied to the size of a similar sample with 
replacement. Thus, we can apply a Chernoff bound (e.g., see |34[ I35j) to this single random matching and 
pairwise set of compare-exchange operations, to derive 

Pr (k[ c ' +1) > (1 + S)f?j < 2"^, 

provided 5 > 2e - 1. Taking (1 + 5)fi = 2eM(N) implies S > 2e - 1, where M(N) = (2e) c '(l - e) c ' +2 iV; 
hence, we can bound 

Pr (k[ c ' +1) > 2eM(N)) < 2 -Q* M W-A < 2"^, 

which also gives us a new bound on M(N) for the next step in the induction. Provided M(N) > 21ogn, 
then this (failure) condition holds with probability less than n~ A . If, on the other hand, M{N) < 21ogn, 
then 

Pr (k[ c ' +1) > 4elogn) < 2- (4el ° sn -^ < 2- 2elo s" < n~ 4 . 

In this latter case, we can terminate the induction, since repeated applications of the region compare- 
exchange operation can only improve things. Otherwise, we continue the induction. At some point during 
the induction, we must either reach d + 1 = c, at which point the inductive hypothesis implies the lemma, 
or we will have M(N) < 21ogn, which puts us into the above second case and implies the lemma. ■ 

A similar argument applies to the case of the k smallest elements, which gives us the following. 

Corollary 3.6 If A\ and A 2 are two regions of size N each, then a compare- exchange operation con- 
sisting of c random matchings (with compare-exchanges between matched pairs) between A\ and A 2 is a 
(c + 1, (2e) c , 4elogn) -leveraged- splitter with probability at least 1 — cn~ A . 

As we noted above, this corollary is only of use for the case when (1 — e) < l/(2e), where e is the same 
parameter as used in the definition of a (fj,, a, /3)-leveraged-splitter. 

3.4 The (0, 1/6, 0)-Leveraged-Splitter Property 

The final property we prove is for the (0, 1/6, 0)-leveraged-splitter property. As with the other two properties, 
we consider here the k largest elements, and focus on the case of a zero-one array. 

Lemma 3.7 ((Startup Lemma)) Given two regions, A\ and A 2 , each of size N , let k = k\ + k 2 , where 
ki and k 2 are the respective number of ones in A\ and A 2 , and k < N . let k^ be the number of ones in 
A\ after c region compare- exchange operations. Then k[ 4 ^ < N/6, with very high probability, provided N is 
n(lnn). 
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Proof: The proof involves four consecutive applications of a theorem of Hoeffding 22 that the expected 
value of any convex function of the size of a sample without replacement is bounded by the expected value 
of that function applied to the size of a similar sample with replacement. Thus, we can apply a Chernoff 
bound (e.g., see [Ml [35]) to each single random matching and pairwise set of compare-exchange operations, 
to derive bounds on failure probabilities. As noted above, 



since fc x (1 — ki/N) is maximized at k\ = N/2. Thus, by a well-known Chernoff bound (e.g., see 



3 J V V 3 / 4 / V(4/3) 4/3 

So let us assume fc x < N/3. Since fei (1 — ki/N) is monotonic on [0, N/2], 

Thus, by another application of a Chernoff bound, 



(2) 

So let us assume k\ < N/4. Since k\ (1 — ki/N) is monotonic on [0, N/2] 



E ( k[ 3) )< - I 1 



' l 



N / 1\ 37V 



4 V 4 7 16 



Thus, again applying a Chernoff bound, 



So let us assume that k\ < AT/5. Thus, since k\ (1 — k\/N) is monotonic on [0,iV/2], 



, , . , , N ( \\ AN 

e kr j < — i 



5 V 5/ 25 



Thus, by one more application of a Chernoff bound, 



ey V V 24 y 25 y - V(25/24p/ 24 

The proof follows by summing these four failure probabilities and the fact that N is 57 (hm). ■ 

Of course, the above lemma has an obvious symmetric versions that applies to the number of zeroes on 
the right side of two regions in a region compare-exchange. Thus, we have the following. 

Corollary 3.8 If A\ and Ai are two regions of size N each, then a compare-exchange operation consisting 
of at least 4 random matchings (with compare-exchanges between matched pairs) between A\ and A2 is a 
(0, 1/6, 0) -leveraged- splitter with probability at least 1 — cn~ 4 , provided N is J7(lnn). 
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3.5 The (c + 1, 1/2, iV/20)-Leveraged-Splitter Property 

Finally, let us consider one additional lemma characterizing region compare-exchange operations. This one 
is most useful in contexts where e is relatively small. 

Lemma 3.9 ((Slow-Depletion Lemma)) Given two regions, A\ and A<i, each of size N, let k = ki+k 2 < 
(1 — e)N, where k\ and k 2 are the respective number of ones in Ai and A 2 , and k < N and < e < 1. Let 
k[ c ^ be the number of ones in A\ after c random matches in a compare- exchange operation. Then 



Pr 



(k[ c) > max{(l - e) c+1 N/2, TV/20}) < ce-M^ 1 ^ 1 



Proof: The proof is by induction on c. For the base case, note that if k% = k^ < N/20, then we are 
done, since random compare-exchanges between A\ and A 2 can only improve the number of ones in A\. 
Furthermore, if (1 — t)N/2 < N/20, then we are guaranteed to have fcj 1 ' < N/20, since the only way one 
can stay in Ai is if it is matched with a one in A 2 . So suppose k\ > N/20 and (1 — e)N/2 > N/20. In this 
case, recall that 

EVP) < <i^. 

Thus, combining the theorem of Hoeffding [55] that the expected value of any convex function of the size 
of a sample without replacement is bounded by the expected value of that function applied to the size of a 
similar sample with replacement, and a well-known Chernoff bound, 



(^> (1 -f jy )<(|) (1 " W4 <e-a-«)'^ 



Pr Utl ' > 

For the inductive step, for c > 2, let us assume the lemma is true for c — 1 random matchings and that 
(1 — e) c ~ 1 N/2 > N/20. Let us consider the case when previous steps succeeded, in which case, since x(l — x) 
is monotonic for x £ [0, 1/2], 



2 V 2 

\c+l N ( (l-e)^ 1 



< 



2 

(1 - e) c+1 N /19 



20 



Thus, by a well-known Chernoff bound, 

Pr(k[ c) > ( 1 - £ ) C+1 - ) = Pr( fc W> 

( 



(c) ^ (l~ef^N\ pr . / (c) ^ ^20^ ^19^ (1 - eY+^N 



< 



e 



19 J \20 ) 2 

1/19 \ 19(l-e) c+1 JV/40 



V(20/19) 2 °/ 19 

< e -(l-.)<= +1 7V/2 12 _ 

If, after c random matchings, k\ < N/20 or (1 — e) c N/2 < N/20, then we are done, w.v.h.p. Thus, either 
we satisfy the condition of the lemma or we can continue the induction. Therefore, the proof follows by 
summing all the failure probabilities. ■ 

This implies the following. 

Corollary 3.10 If Ai and A 2 are two regions of size N each, then a compare- exchange operation con- 
sisting of c random matchings (with compare-exchanges between matched pairs) between A\ and A 2 is a 
(c + 1, 1/2, N / 20) -lever aged- splitter with probability at least 1 — ce~ ( - 1 ~ e - ) + N l 2 . 
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4 Analyzing the Core Algorithm 



Having proven the essential properties of a region compare-exchange operation, consisting of c random 
matchings (with compare-exchanges between matched pairs), we now turn to the problem of analyzing the 
core part of our randomized Shellsort algorithm. 

4.1 A Probabilistic Zero-One Principle 

We begin our analysis with a probabilistic version of the zero-one principle. 

Lemma 4.1 If a randomized data- oblivious sorting algorithm sorts any binary array of size n with failure 
probability at most e, then it sorts any arbitrary array of size n with failure probability at most e(n + 1). 

Proof: The lemmg^] follows from the proof of Theorem 3.3 by Rajasekaran and Sen [33], which itself is based 
on the justification of Knuth |26| for the deterministic version of the zero-one principle for sorting networks. 
The essential fact is that an arbitrary n-element input array, A, has, via monotonic bijections, at most n + 1 
corresponding n-length binary arrays, such that A is sorted correctly by a data-oblivious algorithm, A, if 
and only if every bijective binary array is sorted correctly by A. (See Rajasekaran and Sen |39j or Knuth |26j 
for the proof of this fact.) ■ 

Note that this lemma is only of practical use for randomized data-oblivious algorithms that have failure 
probabilities of at most 0(n~ a ), for some constant a > 1. We refer to such algorithms as succeeding 
with very high probability. Fortunately, our analysis shows that our randomized Shellsort algorithm will 
a poly log (n)-near-sort a binary array with very high probability. 

4.2 Bounding Dirtiness after each Iteration 

In the d-th iteration of our core algorithm, we partition the array A into 2 d regions, Aq, A±, . . ., ^4 2 d -ij each 
of size n/2 d . Moreover, each iteration splits a region from the previous iteration into two equal-sized halves. 
Thus, the algorithm can be visualized in terms of a complete binary tree, B, with n leaves. The root of 
B corresponds to a region consisting of the entire array A and each leaf of B corresponds to an individual 
cell, cij, in A, of size 1. Each internal node v of B at depth d corresponds with a region, Ai, created in the 
d-th iteration of the algorithm, and the children of v are associated with the two regions that Ai is split into 
during the (d + l)-st iteration. (See Figure [3]) 




Figure 3: The binary tree, B, and the distance of each region from the mixed region (shown in dark gray). 

The desired output, of course, is to have each leaf value, aj = 0, for i < n — k, and ai — 1, otherwise. We 
therefore refer to the transition from cell n — k — 1 to cell n — k on the last level of B as the crossover point. 

2 A similar lemma is provided by Blackston and Ranade 0, but they omit the proof. 
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We refer to any leaf-level region to the left of the crossover point as a low region and any leaf-level region 
to the right of the crossover point as a high region. We say that a region, Ai, corresponding to an internal 
node v of B, is a low region if all of w's descendents are associated with low regions. Likewise, a region, 
Ai, corresponding to an internal node v oi B, is a high region if all of v's descendents are associated with 
high regions. Thus, we desire that low regions eventually consist of only zeroes and high regions eventually 
consist of only ones. A region that is neither high nor low is mixed, since it is an ancestor of both low and 
high regions. Note that there are no mixed leaf-level regions, however. 

Also note that, since our randomized Shellsort algorithm is data-oblivious, the algorithm doesn't take any 
different behavior depending on whether is a region is high, low, or mixed. Nevertheless, since the region- 
compare operation is w.v.h.p. a {a, a, /3)-leveraged-splitter, for each of the (/i,a,/3) tuples, (c + 1, 1/2,0), 
(c+ 1, (2e) c , 4elogn), and (0, 1/6, 0), we can reason about the actions of our algorithm on different regions 
in terms of any one of these tuples. 

With each high (resp., low) region, Ai, define the dirtiness of Ai to be the number of zeroes (resp., 
ones) that are present in Ai, that is, values of the wrong type for Ai. With each region, Ai, we associate a 
dirtiness bound, 5{Ai), which is a desired upper bound on the dirtiness of Ai. 

For each region, Ai, at depth d in B, let j be the number of regions between Ai and the crossover point 
or mixed region on that level. That is, if Ai is a low leaf-level region, then j = n — k — i — 1, and if Ai is a 
high leaf-level region, then j = j — n + k. We define the desired dirtiness bound, S(Ai), of Ai as follows: 



• If j > 2, then 

• If j = 1 , then 

• If Ai is a mixed region, then 



5{Ai) = 



S(Ai) 



5- 2 d ' 
5{Ai) = \AA. 



Thus, every mixed region trivially satisfies its desired dirtiness bound. 

Because of our need for a high probability bound, we will guarantee that each region Ai satisfies its 
desired dirtiness bound, w.v.h.p., only if 5(Ai) > 12elogn. If 8{Ai) < 12elogn, then we say Ai is an 
extreme region, for, during our algorithm, this condition implies that Ai is relatively far from the crossover 
point. (Please see Figure |4j for an illustration of the "zones of order" that are defined by the low, high, 
mixed, and extreme regions in A) 



mixed region 



low regions 



high regions 



Odog-'n) 



n-T-rr-rlT 

extreme regions 




extreme regions 



Figure 4: An example histogram of the dirtiness of the different kinds of regions, as categorized by the 
analysis of the randomized Shellsort algorithm. By the inductive claim, the distribution of dirtiness has 
exponential tails with polylogarithmic ends. 

We will show that the total dirtiness of all extreme regions is (9(log 3 n) w.v.h.p. Thus, we can terminate 
our analysis when the number and size of the non-extreme regions is polylog(n), at which point the array A 
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will be 0(pofylog(n))-near-sorted w.v.h.p. Throughout this analysis, we make repeated use of the following 
simple but useful lemma. 

Lemma 4.2 Suppose Aj is a low (resp., high) region and A is the cumulative dirtiness of all regions to the 
left (resp., right) of Ai. Then any region compare- exchange pass over A can increase the dirtiness of Ai by 
at most A. 

Proof: If Ai is a low (resp., high) region, then its dirtiness is measured by the number of ones (resp., zeroes) 
it contains. During any region compare-exchange pass, ones can only move right, exchanging themselves 
with zeroes, and zeroes can only move left, exchanging themselves with ones. Thus, the only ones that can 
move into a low region are those to the left of it and the only zeroes that can move into a high region are 
those to the right of it. ■ 



4.3 An Inductive Argument 

The inductive claim we wish to show holds with very high probability is the following. 

Claim 4.3 After iteration d, for each region Ai, the dirtiness of Ai is at most 8{Ai), provided Ai is not 
extreme. The total dirtiness of all extreme regions is at most 12e<ilog 2 n. 

Let us begin at the point when the algorithm creates the first two regions, Ai and A 2 . Suppose that 
k < n — k, where k is the number of ones, so that Ai is a low region and Ai is either a high region (i.e., if 
k = n—k) or Ai is mixed (the case when k > n—k is symmetric). Let k\ (resp., ^2) denote the number of ones 



in A\ (resp., A2), so k = ki + &2- By the Startup Lemma (3.7|, the dirtiness of A\ will be at most n/12, with 
very high probability, since the region compare-exchange operation is a (0, 1/6, 0)-leveraged-splitter. Note 
that this satisfies the desired dirtiness of A\, since b~{A\) = n/10 in this case. A similar argument applies to 
A 2 if it is a high region, and if A 2 is mixed, it trivially satisfies its desired dirtiness bound. Also, assuming 
n is large enough, there are no extreme regions (if n is so small that A\ is extreme, we can immediately 
switch to the postprocessing cleanup phase). Thus, we satisfy the base case of our inductive argument — the 
dirtiness bounds for the two children of the root of B are satisfied with (very) high probability, and similar 
arguments prove the inductive claim for iterations 2 and 3. 

Let us now consider a general inductive step. Let us assume that, with very high probability, we have 
satisfied Claim [473] for the regions on level d > 3 and let us now consider the transition to level d + 1. In 



addition, we terminate this line of reasoning when the region size, n/2 d , becomes less than 16e 2 log n, at 



which point A will be O(polylog(rt))-near-sorted, with very high probability, by Claim 4.3 and Lemma 4.1 



4.3.1 Extreme Regions 

Let us begin with the bound for the dirtiness of extreme regions in iteration d+1. Note that, by Lemma [472] 
regions that were extreme after iteration d will be split into regions in iteration d+1 that contribute no new 
amounts of dirtiness to pre-existing extreme regions. That is, extreme regions get split into extreme regions. 
Thus, the new dirtiness for extreme regions can come only from regions that were not extreme after iteration d 
that are now splitting into extreme regions in iteration d+1, which we call freshly extreme regions. Suppose, 
then, that A t is such a region, say, with a parent, A p , which is j regions from the mixed region on lev el d 
Then the desired dirtiness bound of Aj's parent region, A p , is S(A P ) = n/2 +J+3 > 12elogn, by Claim 4.3 



since A p is not extreme. A p has (low-region) children, Ai and Ai+i, that have desired dirtiness bounds of 
S(AA = n/2 d+1 +^+ 4 or S(Ai) = n/2 d+1 +^+ 3 and of S(A i+1 ) = n/2 d+1 +^+ 3 or 5(A l+1 ) = n/2 d+1 +^+ 2 , 
depending on whether the mixed region on level d+1 has an odd or even index. Moreover, Ai (and possibly 
A i+ i) is freshly extreme, so n/2 d+1+2j+4 < 12elogn, which implies that j > (logn — d — log log n — 10)/2. 
Nevertheless, note also that there are O(logn) new regions on this level that are just now becoming extreme, 
since n/2 d > 16e 2 log 6 n and n/2 d+: > +3 > 12elogrt implies j < logn — d. So let us consider the two new 
regions, A4 and A+i, in turn, and how the shaker pass effects them (for after that they will collectively 



satisfy the extreme-region part of Claim 4.3 1 
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Region A^. Consider the worst case for S(Ai), namely, that S(Ai) = n/2 d+1+2: > +A . Since Ai is a left 
child of A p , Ai could get at most n/2 d+: > +3 + 12e<ilog 2 n ones from regions left of Ai, by Lemma 4.2 In 



addition, Ai and Ai+i could inherit at most S(A p ) = n/2 d+: > +3 ones from A p . Thus, if we let N denote 
the size of A h i.e., N = n/2 d+1 , then A, and A i+1 together have at most N/2 ]+l + 3N 1 / 2 < N/2 1 ones, 



since we stop the induction when N < 16e log n. By the Little-Region Lemma (3.5 1, the following 
condition holds with probability at least 1 — cn -4 , 

k[ c) < max{(2e) c (l-e) c+1 AT, 4elogn}, 

(c) 

where k\ is the number of one left in Ai after c region compare-exchanges with Ai+i, since the region 
compare-exchange operation is a (c+ 1, (2e) c , 4elogn)-leveraged-splitter. Note that, if k± < 4elogn, 
then we have satisfied the desired dirtiness for Ai. Alternatively, so long as c > 4, and j > 5, then 
w.v.h.p., 

*<»> < (2e)«(l - e)o»N < 2 J^ Ij 

Tl 

Region Ai + \: Consider the worst case for <$(Aj+i), namely SLAi+i) = n/2 d+1+2j+3 . Since Ai + i is 
a right child of A p , A i+1 could get at most n/2 d+J + 3 + 12edlog 2 n ones from regions left of A i+ i, by 
Lemma 



4.2 



plus Ai + i could inherit at most 5(A p ) = n/2 d+: > +3 ones from A p itself. In addition, since 
j > 2, Ai + 2 could inherit at most n/2 d+J+2 ones from its parent. Thus, if we let N denote the size 
of A l+1 , i.e., N = n/2 d+1 , then A l+1 and A l+2 together have at most iV/2 J + 37V 1/2 < N/2 1 " 1 ones, 



since we stop the induction when N < 16e log n. By the Little- Region Lemma (3.5 1, the following 
condition holds with probability at least 1 — cn~ 4 , 

k{ c) < max{(2e) c (l - e) c+1 A^, 4elogn}, 

where k[°^ is the number of ones left in Ai + i after c region compare-exchange operations, since the region 
compare-exchange operation is a (c+ 1, (2e) c , 4e log n)-leveraged-splitter. Note that, if k[ c ^ < 4elogn, 
then we have satisfied the desired dirtiness bound for A i+1 . Alternatively, so long as c > 4, and j > 6, 

k[ c) < (2e) c (l-e) c+1 N <- 



2d+l+(i-l)(c+l) 



^ 2rf +i+ 2j +2 <12elogn = 5(A i+1 ). 

Therefore, if a low region Ai or A i+ i becomes freshly extreme in iteration d + 1, then, w.v.h.p., its 
dirtiness is at most 12elogn. Since there are at most logn freshly extreme regions created in iteration d+ 1, 
this implies that the total dirtiness of all extreme low regions in iteration d + 1 is at most 12e(d + 1) log 2 n, 



w.v.h.p., after the right-moving shaker pass, by Claim 4.3 Likewise, by symmetry, a similar claim applies 
to the high regions after the left-moving shaker pass. Moreover, by Lemma |4.2| these extreme regions will 
continue to satisfy Claim [473] after this. 



4.3.2 Non-extreme Regions not too Close to the Crossover Point 

Let us now consider non-extreme regions on level d + 1 that are at least two regions away from the crossover 
point on level d + 1. Consider, wlog, a low region, A p , on level d, which is j regions from the crossover 
point on level d, with A p having (low- region) children, Ai and Aj+i, that have desired dirtiness bounds of 
5(Ai) = n/2 d+1 +^ +i or S(A Z ) = n/2 d+1 +^+ 3 and of S(A l+1 ) = n/2 d+1+2 i+ 3 or S(A l+1 ) = n/2 d+1 +^+ 2 , 
depending on whether the mixed region on level d + 1 has an odd or even index. By Lemma |4.2| if we can 
show w.v.h.p. that the dirtiness of each such Ai (resp., Ai + i) is at most S(Ai)/3 (resp., 6(Ai+i)/3), after 
the shaker pass, then no matter how many more ones come into Ai or Ai + \ from the left during the rest of 
iteration d + 1, they will satisfy their desired dirtiness bounds. 

Let us consider the different region types (always taking the most difficult choice for each desired dirtiness 
in order to avoid additional cases): 
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Type 1: 8{Ai) = n/2 !i+1+2j + 4 , with j > 2. Since A l is a left child of A p , A, could get at most 



n/2 d+: > +3 + 12edlog 2 n ones from regions left of Ai, by Lemma 4.2 In addition, Ai and Aj+i could 
inherit at most S(A p ) — n/2 d+J+3 ones from A p . Thus, if we let N denote the size of Ai, i.e., 
N = n/2 d+1 , then A t and A i+X together have at most N/2 j+1 + 3N 1 / 2 < N/2 j ones, since we stop the 
induction when N < 16e 2 log 6 n. If (1 - e) c+1 N/2 10 ln(2c - 1) > 4 Inn, then, by the Fast-Depletion 
Lemma (3.3), the following condition holds with probability at least 1 — n~ 4 , provided c > 4: 



2 - 2rf+i+j(c+i)+i 

where k x is the number of ones left in Ai after c region compare-exchange operations, since the 
region compare-exchange operation is a (c + 1, 1/2, 0)-leveraged-splitter. If, on the other hand, (1 — 
e) c+1 jV/2 10 ln(2c— 1) < 4 Inn, then j is f2(loglogn), so we can assume j > 6, and, by the Little-Region 



Lemma (3.5), the following condition holds with probability at least 1 — cn in this case: 

k{ c) < max{(2e) c (l-e) c+1 7V, 4elogn}, 

since the region compare-exchange operation is a (c + 1, (2e) c , 4e log n)-leveraged-splitter. Note that, 
since Aj is not extreme, if k[ c ^ < 4elogn, then k[ < S(Ai)/3. Alternatively, so long as c > 4, then, 
w.v.h.p., 

ft 

< 3 . 2d+ i +2j+ 4 =^)/ 3 - 
• Type 2: 5(A i+ i) = n/2 d+1+2j+3 , with j > 2. Since A i+ x is a right child of A p , A i+1 could get at 



most n/2 d+J+3 + 12edlog 2 n ones from regions left of Aj+i, by Lemma 4.2 plus Aj+i could inherit at 
most S(A p ) = n/2 +J+3 ones from A p . In addition, since j > 2, Ai + 2 could inherit at most n/2 d+:,+2 
ones from its parent. Thus, if we let N denote the size of i-e., N = n/2 d+1 , then Ai + i and Ai+2 

together have at most N/2^ +3N 1 / 2 < N/2^~ l ones, since we stop the induction when N < 16e 2 log' 



D. 



If (1 — e) c+1 JV/2 10 ln(2c— 1) > 4 Inn, then, by the Fast-Depletion Lemma ( |3.3[ ), the following condition 
holds with probability at least 1 — n~ 4 , for a suitably-chosen constant c, 

,(2c-l) ^ (l^eY^N 



kr~ x) < ^ — 4 — - < 



2 - 2<i+i+(j-i)(c+i)+i 

Ti 

^ 3, 2 d+l+2j+3 =^+l)A 

where fc^ is the number of ones left in after c region compare-exchange operations. If, on the 

other hand, (1 — e) c+1 N/2 10 l n(2c— 1) < 4 In n, then j is f2(log log n), so we can now assume j > 6, and, 
by the Little-Region Lemma (3.5), the following condition holds with probability at least 1 — cn~ 4 : 



k[ c) < max{(2e) c (l - e) c+1 N , 4elogn}. 

Note that, since Ai is not extreme, if k x < 4elogn, then k[ c ^ < S(Ai + i)/3. Thus, we can choose 
constant c so that 

k {c) < (2pYh r) c+1 N< ( 2e ) Cn 

*i S (/ej (1 e) l\ S 2d+1+0 -_ 1)(c+1) 

Ti 

^ 3 . 2 d+l+2j+3 = 5 (^+l)/ 3 - 

Type 3: S(A i+ i) — n/2 d+1+2j+3 , with j = 2. Since A i+ i is a rig ht ch ild of A p , A i+ i could get at most 
n,/2 rf+J+3 + 12eeHog 2 n ones from regions left of A i+ i, by Lemma 4.2 plus A i+ i could inherit at most 
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5(A P ) — n/2 d+: > +3 ones from A p . In addition, since j — 2, Ai+2 could inherit at most n/ (5-2 d ) ones from 
its parent. Thus, if we let N denote the size of Aj+i, i.e., N — n/2 d+1 , then Ai+i and Ai + 2 together 
have at most N/2 j+1 + 2N/5 + 3N 1 / 2 < 3N/5 ones, since we stop the induction when N < 16e 2 log 6 n. 
In addition, note that this also implies that as long as c is a constant, (1 — e) c+1 N/2 W ln(2c— 1) > 4 Inn. 
Thus, by the Fast-Depletion Lemma ( |3.3[ ), we can choose constant c so that the following condition 
holds with probability at least 1 — n . 

fc (2c-l) < (1-^) C+1 ^ < ^ C+1 " 



2 ~ 5 c+1 2 d + 2 
^ 3 . 2 rf+i+2j+3 = (5 (^+i)/3, 



where fc^ is the number of ones left in A i+ i after c region compare-exchange operations. 

Type 4: S(Ai) = n/2 d+1+2j+4 , with j = 1. Since A l is a left child of A p , Ai could get at most 
n/2 d+ i +2 + 12ed\og 2 n ones from regions left of A i: by Lemma 4.2 plus and A i+ i could inherit 



at most S(A p ) = n/(5 ■ 2 d ) ones from A p . Thus, if we let N denote the size of Ai, i.e., N — n/2 d+1 , 
then Ai and A i+1 together have at most N/2 3+1 + 2N/5 + 37V 1 / 2 < 7N/10 ones, since we stop the 
induction when TV < 16e 2 log 6 n. In addition, note that this also implies that as long as c is a constant, 
(1 — e) c+1 7V/2 10 ln(2c — 1) > 4 Inn. Thus, by the Fast-Depletion Lemma (3.3|, the following condition 
holds with probability at least 1 — n~ 4 , for a suitably-chosen constant c, 

fe (2c-l) < (1-eY^N < 7^n 



"i - 2 10 c+1 2 d + 2 



< 



n 



3 • 2 d + 1 - 



where k[ c ^ is the number of ones left in Ai after c region compare-exchange operations. 

Thus, Ai and Aj+i satisfy their respective desired dirtiness bounds w.v.h.p., provided they are at least two 
regions from the mixed region or crossover point. 

4.3.3 Regions near the Crossover Point 

Consider now regions near the crossover point. That is, each region with a parent that is mixed, bordering 
the crossover point, or next to a region that either contains or borders the crossover point. Let us focus 
specifically on the case when there is a mixed region on levels d and d+ 1, as it is the most difficult of these 
scenarios. 

So, having dealt with all the other regions, which have their desired dirtiness satisfied after the shaker 
pass, we are left with four regions near the crossover point, which we will refer to as A\, A%, A$, and A4. 
One of A2 or ^3 is mixed — without loss of generality, let us assume A3 is mixed. At this point in the 
algorithm, we perform a brick-type pass, which, from the perspective of these four regions, amounts to a 
complete 4-tournament. Note that, by the results of the shaker pass (which were proved above), we have at 
this point pushed to these four regions all but at most n/2 d+7 + 12e(d+ 1) log 2 n of the ones and all but at 
most n/2 d+6 + I2e(d+ 1) log 2 n of the zeroes. Moreover, these bounds will continue to hold (and could even 
improve) as we perform the different steps of the brick-type pass. Thus, at the beginning of the 4-tournament 
for these four regions, we know that the four regions hold between 2N—N/32 — 3N 1 / 2 and 3-/V+ A r /64+3A r1 / 2 
zeroes and between N - iV/64 - 3iV 1/2 and 2N + N/32 + 3N 1 / 2 ones, where N = n/2 d+1 > 16e 2 log 6 n. For 
each region compare-exchange operation, we distinguish three possible outcomes: 



balanced: Ai and A^j have between 31./V/32 and 33/32 zeroes (and ones). In this case, the Startup 



Lemma (3.7) implies that Ai will get at least 317V/32 — iV/6 zeroes and at most A^/32 + N/6 ones, and 



A i+ j will get at least 317V/32 — N/6 ones and at most A^/32 + N/6 zeroes, w.v.h.p. 



• 0-heavy: Ai and Ai + j have at least 33./V/32 zeroes. In this cas^j by the Slow-Depletion Lemma (3.9 1, 
Ai will get at most iV/20 ones, w.v.h.p., with appropriate choice for c. 



3 The constant factor can be improved somewhat by first applying the Startup Lemma and then applying the Slow-Depletion 
Lemma. 
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• 1-heavy: Aj and Ai + j have at least 33-/V/32 ones. In this case, by the Slow-Depletion Lemma (3.9 1, 
Ai + j will get at most N/20 zeroes, w.v.h.p., with appropriate choice for c. 

Let us focus on the four regions, Ax, A2, A3, and A4, and consider the region compare-exchange operations 
that each region participates in as a part of the 4-tournament for these four. 

• Ai: this region is compared to A4, A3, and A2, in this order. If the first of these is 0-heavy, then we 
already will satisfy Ax's desired dirtiness bound (which can only improve after this). If the first of these 
comparisons is balanced, on the other hand, then Ax ends up with at least 31iV/32 — N/6 w 0.8027V 
zeroes (and A 4 will have at most iV/32 + iV/6 « 0.198JV). Since there are at least 2iV-iV/32-37V 1/2 w 
1.9iV zeroes distributed among the four regions, this forces one of the comparisons with A3 or A2 to 
be 0-heavy, which will cause Ax to satisfy its desired dirtiness. 

• A2: this region is compared to A4, Ax, and A3, in this order. Note, therefore, that it does its 
comparisons with Ax and A3 after Ax- But even if Ax receives N zeroes, there are still at least 
317V/32 — 37V 1 / 2 zeroes that would be left. Thus, even under this worst-case scenario (from A 2 's 
perspective), the comparisons with A2 and A4 will be either balanced or 1-heavy. If one of them is 
balanced (and even if Ax is full of zeroes), then A2 gets at least 31-/V/32 — N/6 ~ 0.802iV zeroes. If 
they are both 1-heavy, then A2 and A3 end up with at most iV/20 zeroes each, which leaves A2 with 
at least 317V/32 - N/10 « 0.8697V zeroes, w.v.h.p. 

• A3: by assumption, A3 is mixed, so it automatically satisfies its desired dirtiness bound. 

• Ax: this region is compared to Ax, A2, and A3, in this order. If any of these is balanced or 1-heavy, then 
we satisfy the desired dirtiness bound for Ax- If they are all 0-heavy, then each of them ends up with at 
most iV/20 ones each, which implies that A 4 ends up with at least iV-iV/64-3iV/20-3iV 1/2 « 0.81iV 
ones, w.v.h.p., which also satisfies the desired dirtiness bound for Ax. 

Thus, after the brick-type pass of iteration d+ 1, we will have satisfied Claim [4~3] w.v.h.p. In particular, 



we have proved that each region satisfies Claim 4.3 after iteration d + 1 with a failure probability of at most 
0(n~ 4 ), for each region compare-exchange operation we perform. Thus, since there are 0(n) such regions 
per iteration, this implies any iteration will fail with probability at most 0(n~ 3 ). Therefore, since there are 
0(log n) iterations, and we lose only an 0(n) factor in our failure probability when we apply the probabilistic 



zero-one principle (Lemma 4.1), when we complete the first phase of our randomized Shellsort algorithm, 
the array A will be 0(polylog(n))-near-sorted w.v.h.p., in which case the postprocessing step will complete 
the sorting of A. 



5 Implementation and Experiments 

As an existence proof for its ease of implementation, we provide a complete Java program for randomized 
Shellsort in Figure [5} 

Given this implementation, we explored empirically the degree to which the success of the algorithm 
depends on the constant c, which indicates the number of times to perform random matchings in a region 
compare-exchange operation. We began with c = 1, with the intention of progressively increasing c until we 
determined the value of c that would lead to failure rate of at most 0.1% in practice. Interestingly, however, 
c = 1 already achieved over a 99.9% success rate in all our experiments. 

So, rather than incrementing c, we instead kept c = 1 and tested the degree to which the different parts of 
the brick- type pass were necessary, since previous experimental work exists for shaker passes [9, 24, 23, 48 . 
The first experiment tested the failure percentages of 10,000 runs of randomized Shellsort on random inputs 
of various sizes, while optionally omitting the various parts of the brick pass while keeping c = 1 for region 
compare-exchange operations and always doing the shaker pass. The failure rates were as follows: 



18 



import java.util.*; 
public class ShellSort { 

public static final int C=4; // number of region compare-exchange repetitions 

public static void exchange(int[] a, int i, int j) { 
int temp = a[i]; 

a[i] = affl; 
a[j] = temp; 

} 

public static void compareExchange(int[] a, int i, int j) { 

if (((i < j) (a[i] > ap])) | | ((i > j) (a[i] < a [j]))) 10 
exchange(a, i, j); 

} 

public static void permuteRandom(int a[], MyRandom rand) { 
for (int i=0; i<a. length; i++) // Use the Knuth random perm, algorithm 
exchange(a, i, rand. nextlnt(a. length — i)+i); 

} 

// compare-exchange two regions of length offset each 

public static void compareRegions(int[] a, int s, int t, int offset, MyRandom rand) { 
int mate[] = new int[offset]; // index offset array 

for (int count=0; count<C; countH — h) { // do C region compare-exchanges 20 
for (int i=0; i<offset; i++) mate[i] = i; 

permuteRandom(mate.rand); // comment this out to get a deterministic Shellsort 
for (int i=0; i<offset; i++) 
compareExchange(a, s+i, t+mate[i]); 

} 

} 

public static void randomizedShellSort(int[] a) { 
int n = a. length; // we assume that n is a power of 2 

MyRandom rand = new MyRandom(); // random number generator (not shown) 

for (int offset = n/2; offset > 0; offset /= 2) { 30 
for (int i=0; i < n — offset; i += offset) // compare-exchange up 

com pareRegions(a,i,i+offset, offset, rand); 
for (int i=n— offset; i >= offset; i — = offset) // compare-exchange down 

com pareRegions(a,i— offset, i .offset, rand); 
for (int i=0; i < n— 3*offset; i += offset) // compare 3 hops up 

com pareRegions(a,i,i+3*offset, offset, rand); 
for (int i=0; i < n— 2*offset; i += offset) // compare 2 hops up 

com pareRegions(a,i,i+2*offset, offset, rand); 
for (int i=0; i < n; i += 2*offset) // compare odd-even regions 

com pareRegions(a,i,i+offset, offset, rand); 40 
for (int i=offset; i < n— offset; i += 2*offset) // compare even-odd regions 

com pareRegions(a,i,i+offset, offset, rand); 

} 

} 

} 



Figure 5: Our randomized Shellsort algorithm in Java. Note that, just by commenting out the call to 
permuteRandom, on line 22, in compareRegions, this becomes a deterministic Shellsort implementation. 
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n 



no brick pass 



no short jumps 



no long jumps 



full pass 



128 

256 

512 

1024 

2048 

4096 

8192 

16384 

32768 

65536 

131072 

262144 

524288 

1048576 



68.18% 
93.27% 
99.86% 
100.00% 
100.00% 
100.00% 
100.00% 
100.00% 
100.00% 
100.00% 
100.00% 
100.00% 
100.00% 
100.00% 



33.92% 
60.11% 
85.62% 
98.27% 
99.98% 
100.00% 
100.00% 
100.00% 
100.00% 
100.00% 
100.00% 
100.00% 
100.00% 
100.00% 



0.01% 

0% 

0% 
0.01% 
0.03% 
0.17% 
0.14% 
0.35% 
0.71% 
1.53% 
2.55% 
5.29% 
10.88% 
21.91% 



0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0.01% 
0% 



Thus, the need for brick-type passes when c = 1 is established empirically from this experiment, with a 
particular need for the short jumps (i.e., the ones between adjacent regions), but with long jumps still being 
important. 

We next continued the experiment on larger arrays, testing 1,000 runs of randomized Shellsort on random 
inputs of various sizes, tabulating the failure percentages for performing short-jumps only and full-brick 
passes. The failure rates were as follows: 



6 Conclusion and Open Problems 

We have given a simple, randomized Shellsort algorithm that runs in O(nlogn) time and sorts any given 
input permutation with very high probability. This algorithm can alternatively be viewed as a randomized 
construction of a simple compare-exchange network that has O(nlogn) size and sorts with very high proba- 
bility. Its depth is not as asymptotically shallow as the AKS sorting network [T] and its improvements [3"6"ll45| . 
but its constant factors are much smaller and it is quite simple, making it an alternative to the randomized 
sorting- network construction of Leighton and Plaxton [29]. Some open questions and directions for future 
work include the following: 

• For what values of /i, a, and /3 can one deterministically and effectively construct (/z, a, /3)-leveraged- 
splitters? 

• Is there a simple deterministic 0{n log n)-sized sorting network? 

• Can the randomness needed for a randomized Shellsort algorithm be reduced to a polylogarithmic 
number of bits while retaining a very high probability of sorting? 

• Can the shaker pass in our randomized Shellsort algorithm be replaced by a lower-depth network, 
thereby achieving polylogarithmic depth while keeping the overall O(nlogn) size and very high prob- 
ability of sorting? 

• Can the constant factors in the running time for a randomized Shellsort algorithm be reduced to be 
at most 2 while still maintaining the overall O(nlogn) size and very high probability of sorting? 
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4194304 



n 
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